Some Experiments with the Northamptonshire County Datasets

Anshuman Kanetkar

1 Introduction

This IPython notebook outlines some experiments with the static datasets on the Northamptonshire County Council website. It also steps you through using the DataFetcher helper class, written in Python, which provides an API to:

  • Navigate the dataset hierarchy,
  • Download data files from the website, and
  • Load them into memory for data analysis using Python data analysis tools.

First we need to import the third-party packages we will need. These are:

  • pandas - For spreadsheet-style data frames
  • matplotlib - To generate inline graphs. The %matplotlib magic command enables plotting in the current Notebook, and the 'inline' directive embeds graphs directly in the notebook.

In [30]:
import os
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import re

Then, we import the DataFetcher helper class from the analysis.datafetch.northants module.


In [31]:
from analysis.datafetch.northants import DataFetcher

2 Accessing the Data

The DataFetcher class provides methods to access the datasets provided by the Northamptonshire County Council. The data seems to be organized in the following hierarchy:

  • Themes : These are broad areas of interest under which the various datasets have been organized. Some prominent examples:

    • Population & Census
    • Education
    • Economy

    Each theme has a unique id which is used in organizing the namespace for the website. The Datafetcher uses this to navigate around the site's various parts.

  • Sub-themes : These are further refinements of the individual themes -- for instance, Age & Gender datasets within the Population & Census theme.

  • Dataviews : Each data collection is called a DataView -- all files with the same dataview id have the same layout (ie columns), though they may be collected over different regions.

  • geo ids & names : As mentioned before a DataView could have multiple datasets collected over different target areas, for instance, population data collected at the county, district or country level etc.

3 Initializing the DataFetcher API

First, we need to instantiate a DataFetcher object as follows:


In [39]:
fd = DataFetcher()

Then this newly created DataFetcher object needs to be configured before use -- either by loading a previously saved configuration, or by querying the website. The relevant methods for doing this are:

  • sync_metadata() : This method fetches the metadata from the county website. The metadata is needed to navigate throught the dataset hierarchy and to organize the data when it is downloaded.
  • save_metadata(file_name) : Saves the metadata in the given file in JSON format.
  • load_metadata(file_name) : This method loads the JSON-encoded metadata that we may have saved previously with save_metadata()

In [40]:
if os.path.exists('metadata.json'):
    fd.load_metadata('metadata.json')
else:
    fd.sync_metadata()
    fd.save_metadata('metadata.json')

4 Understanding the datasets

Before we can analyse the datsets, we need to know what the datasets contain and how they are organized.The datasets are stored in the CSV text format, and we need to know the layouts of these CSV files. The DataFetcher helper class provides methods to make this process a bit less painful :-) :

  • get_dataview_ids() : Returns a list of dataview ids, which are unique integers that identify each dataset. These ids can be used as arguments to the fetch_dataview_csv() method described below to fetch the CSV file/s for a dataview.
  • get_dataview_ids_with_details() : Returns a python dict keyed on dataview ids, with the value being another dict that describes the dataview. The attributes returned are:
    • id : The dataview's id
    • title : Tells us what the dataset is.
    • theme : what is the broad theme under which this dataset belongs.
    • subtheme : The sub-theme within the larger theme.
    • description : A more detailed description of the dataset's contents, collection methods etc.

As the the example below shows the dict returned by get_dataview_ids_with_details can be loaded directly into a pandas DataFrame.


In [33]:
dataviews = fd.get_dataview_ids_with_details()
dvdf = pd.DataFrame(dataviews).transpose().sort(['theme'])
dvdf


Out[33]:
description subtheme theme title
199 Residents who provide unpaid care to others by... ASC1: Carers Adult Social Care Census : Residents Providing Unpaid Care
54 Live Births, Fertility Rates, Infant Mortality... C&YP: Early Years and Children's Centres Children and Young People Live Births & Infant Death
112 C&YP: Children's Health Children and Young People Child Immunisations
114 C&YP: Children's Social Care Caseload Children and Young People Looked after children
115 C&YP: Children's Social Care Caseload Children and Young People Looked after Asylum Seekers children
163 Results of the Departments of Health and Educa... C&YP: Children's Health Children and Young People Childhood Obesity
191 % of all Foundation Stage pupils in schools (a... C&YP: School Achievement / Examination Results Children and Young People Achievement in Foundation Stage Profile (Acade...
192 % of those deemed "End of Key Stage 4" who ach... C&YP: School Achievement / Examination Results Children and Young People Achievement at Key Stage 4 GCSE
211 Young people (Under Age 24) claiming Job Seeke... C&YP: Youth Employment and Unemployment Children and Young People Youth Unemployment
230 Children in Low Income Families Local Measure ... C&YP: Deprivation Children and Young People Children in Low Income Families Local Measure ...
250 Characteristics of Young People 10-17 who offe... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Characteristics
251 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
252 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
253 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
254 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
255 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
256 Outcomes/Disposals received by a young female ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
257 Outcomes/Disposals received by a young male pe... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
258 % of young people receiving the four main cate... C&YP: Young People Who Offend Children and Young People Young People Who Offend : % Outcome Categories...
259 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
260 Average Number of Outcomes per young person wh... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Average Number of Ou...
261 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
262 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
263 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
264 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
265 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
266 Outcomes/Disposals received by a young person ... C&YP: Young People Who Offend Children and Young People Young People Who Offend : Outcomes/Disposals (...
269 Achievement in Foundation Stage Profile (data ... C&YP: School Achievement / Examination Results Children and Young People Achievement in Foundation Stage Profile (data ...
270 Results of the Oral Health Survey (dental deca... C&YP: Children's Health Children and Young People Oral Health / Dental Decay Amongst 5 Year Olds
275 Population of children ages 0-4 years C&YP: Early Years and Children's Centres Children and Young People Early Years (0-4s) Population
... ... ... ... ...
209 The types of central heating available in hous... ENVL: Housing HousingProjections HousingMarket... Environment and Living Census : Household Central Heating
213 Main method of travelling to work (longest dis... ENVL: Vehicles Transport Infrastructure Environment and Living Census : Travel to Work Method
218 Flytipping Incidents reported by Districts/Bor... ENVL: Waste and Environmental/Green Issues Environment and Living Flytipping Incidents
227 Household waste collection and management (lan... ENVL: Waste and Environmental/Green Issues Environment and Living Household Waste, Recycling and Landfill
274 Bookstart is a national scheme which offers th... ENVL: Leisure Environment and Living Bookstart Scheme Statistics
293 An Active Library Card User is someone who has... ENVL: Leisure Environment and Living Library Service : Active Library Card Users
136 HEAW: Holding Pen (Unhooked JSNA May be Useful) Health and Well-being Risk Factors and Prevention
167 HEAW: Holding Pen (Unhooked JSNA May be Useful) Health and Well-being Older People Disease Prevalence
231 Self-assessed general health of all usual resi... HEAW: Adults Health and Well-being Census : General Health of All Residents 2011
68 POPC: Population Age Gender Population and Census Child Population 5 Year Age Bands
151 Population Indicators from the 2011 Census whi... POPC: Population Age Gender Population and Census Census : Population 5 Year Age Bands
169 Ethnicity statistics from 2011 Census which w... POPC: Ethnicity Nationality Religion Language Population and Census Census : Ethnicity
174 Main language of the household based on the co... POPC: Ethnicity Nationality Religion Language Population and Census Census : Household Language
178 POPC: Population Age Gender Population and Census Census : Population Density and Average Ages
179 For use with calculations (eg Per Head of Popu... POPC: Population Age Gender Population and Census Population Totals for Performance Calculations
180 This data view provides Period Life Expectancy... POPC: Population Age Gender Population and Census Life Expectancy
181 This view provides population data in the foll... POPC: Population Age Gender Population and Census Census & MYEs : Population in 3 Broad Age Band...
182 Stated Religion of Usual Residents at the time... POPC: Ethnicity Nationality Religion Language Population and Census Census : Religion
194 This view provides population data in the foll... POPC: Population Age Gender Population and Census Census & MYEs : Population in 5 Broad Age Band...
195 This data view provides population data in the... POPC: Population Age Gender Population and Census Census & MYEs : Population in 8 Broad Age Band...
196 This dataview provides population data in othe... POPC: Population Age Gender Population and Census Census & MYEs : Population in Other Adhoc Age ...
197 Marital Status of Persons Aged 16 and Over at ... POPC: Disability MaritalStatus SexualOrientation Population and Census Census : Marital Status 2011
200 Numbers and Percentage of Residents Aged 3 and... POPC: Ethnicity Nationality Religion Language Population and Census Census : Proficiency In English (Individual Re...
210 The language categories in this dataset refle... POPC: Ethnicity Nationality Religion Language Population and Census Census : Main Language (Individual Residents...
229 Census : Country of Birth (General/Top Level) POPC: Ethnicity Nationality Religion Language Population and Census Census : Country of Birth (General/Top Level)
245 This data series comes from two sources :\r\nC... POPC: Population Age Gender Population and Census Census & MYEs : Age By Single Year : Children ...
246 This data series comes from two sources :\r\nC... POPC: Population Age Gender Population and Census Census & MYEs : Age By Single Year : Older Peo...
272 Lone Parent Households with dependent children... POPC: Households Population and Census Census : Lone Parent Households by Gender
285 Total and % of households with dependent child... POPC: Households Population and Census Census : Households With Dependent Children
337 The ethnic group of persons aged 16-64 (workin... POPC: Ethnicity Nationality Religion Language Population and Census Census : Resident Population Aged 16-64 : Ethn...

114 rows × 4 columns

5 Reading & Analyzing the data

Once the DataFetcher object has been intiialized, and the dataset layout is known to us, we can locate and download the CSV files we need. The CSV's can then be directly loaded into pandas DataFrames for subsequent analysis. The DataFetcher class provides the following methods for downloading data from the server:

  • download_dataviews(dir_name, geo_name, verbose=False) : Fetches al the dataviews for the geographical unit specified geo_name (county, district etc.), and stores them under the dir_name folder, organized by theme & subtheme. The verbose flag, if turned on, causes the method to spew out some status info as it downloads the data.
  • download_all_dataviews(dir_name, verbose=False) : Fetches all the dataviews available on the county website and stores them under the dir_name folder, organized by theme & subtheme.
  • fetch_dataview_csv(dv_id, geo_name, verbose=False) : Fetches the dataview specified by dv_id for the specified geo_name and returns a stream handle to the server response -- this can be used to load up a pandas DataFrame, for instance, as shown in the example below:

In [34]:
fp1 = fd.fetch_dataview_csv(54, 'county')
fp2 = fd.fetch_dataview_csv(54, 'district')
fp3 = fd.fetch_dataview_csv(54, 'region')

infants_county = pd.read_csv(fp1)
infants_district = pd.read_csv(fp2)
infants_region = pd.read_csv(fp3)
infants_county


Out[34]:
"County" Name Live Births : Total(2005) Live Births : Total(2006) Live Births : Total(2007) Live Births : Total(2008) Live Births : Total(2009) Live Births : Total(2010) Live Births : Total(2011) Live Births : Total(2012) ... Live Births : % Not In Marriage : Sole Registration (GRP 2)(2012) Numerator.25 Denominator.25 Infant mortality rate(2006) Infant mortality rate(2007) Infant mortality rate(2008) Infant mortality rate(2009) Infant mortality rate(2010) Infant mortality rate(2011) Unnamed: 132
0 34 Northamptonshire 8268 8588 8990 9225 9054 9258 9229 9288 ... 5.7 530 9288 5 4.5 4.7 4.5 4.2 5 NaN
1 E East Midlands NaN NaN 52482 54192 53746 55232 55378 55645 ... 6.1 3368 55645 NaN NaN NaN 5.1 4.7 NaN NaN
2 064 England NaN NaN NaN 672809 671058 687007 688120 694241 ... 5.7 39647 694241 NaN NaN NaN 4.7 4.6 NaN NaN

3 rows × 133 columns

We can now use pandas and matplotlib to manipulate these DataFrames and visualize the results:


In [35]:
def get_filtered(df, col_regex, region):
    def get_filter_years(label):
        m = col_regex.match(label)
        if not m is None:
            return int(m.group(1))
        return None    
    columns = filter(lambda x : False if get_filter_years(x) is None else True, df.columns)
    years = map(get_filter_years, columns)
    filtered_df = df[df['Name'] == region][columns].transpose()
    filtered_df.index = years
    filtered_df.columns = [region]
    return filtered_df

In [36]:
# Modified From http://matplotlib.org/examples/api/barchart_demo.html

import re

reg_births = re.compile(r'^Live\ Births\ \:\ Total\((\d+)\).*$')
reg_deaths = re.compile(r'^Infant\ mortality\ rate\((\d+)\).*$')

def make_bar_chart(df, col_regex, ax, width, title, ylabel):
    north_df = get_filtered(df, col_regex, 'Northamptonshire')
    north_bar = ax.bar(north_df.index, north_df['Northamptonshire'], width, color='r')

    midlands_df = get_filtered(df, col_regex, 'East Midlands')
    midlands_xrange = map(lambda x : x + width, north_df.index)
    midlands_bar = ax.bar(midlands_xrange, midlands_df['East Midlands'], width, color='g')

    england_df = get_filtered(df, col_regex, 'England')
    england_xrange = map(lambda x : x + width, midlands_xrange)
    england_bar = ax.bar(england_xrange, england_df['England'], width, color='b')

    # add some text for labels, title and axes ticks
    ax.set_ylabel(ylabel)
    ax.set_title(title)
    ax.set_xticks(map(lambda x : x + 1.5 * width, north_df.index))
    ax.set_xticklabels( map( lambda x : str(x), north_df.index ) )

    # Legends
    ax.legend( (north_bar[0], midlands_bar[0], england_bar[0]), ('Northamptonshire', 'East Midlands', 'England') )

In [37]:
fig, ax = plt.subplots()
width = 0.5       # the width of the bars

fig.set_size_inches(16, 8)

make_bar_chart(df=infants_county, 
               col_regex=reg_births, 
               ax=ax, 
               width=width,
               title='Live Births in Northants compared to East-Mid region & Eng', 
               ylabel='Live Births')

ax


Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7667650>

In [38]:
fig, ax = plt.subplots()
width=0.2
fig.set_size_inches(12, 8)
make_bar_chart(df=infants_county, 
               col_regex=reg_deaths, 
               ax=ax, 
               width=width,
               title='Infant Mortality in Northants compared to East-Mid region & Eng', 
               ylabel='Infant Mortality (%)')

ax


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7987c50>

6 Conclusions

  • The demographics data provided by the Northamptonshire County provide a useful starting point for understanding the community in question. The way the data is organized on the website is of special interest to us, as a guide on how to organize the datasets we collect data from our investigations. This is important as the datasets we collect would originate from multiple datasources & social media platforms (facebook, twitter, linkedin etc.), and would need to be collated across different facets (eg professional info, activities, education etc.). So, some prior thought on organizing these datasets would benefit us in the long run.

  • The DataFetcher API provides a programmatic interface for accessing the datasets on the council website and downloading them to local storage. Going ahead, the DataFetcher (or something equivalent) could be integrated with social-media crawlers and database backend storage (e.g. a document store such as mongodb) to serve as a layer of storage abstraction for any analysis/ML tools we may need to build.

  • IPython notebooks can be very useful for quickly protyping any data-science ideas or tricks we may want to exchange, and also as a tool for literate programming. Although this notebook was authored in Python, the Jupyter/IPython project supports other data-centric languages such as R & Julia as well.


In [22]:


In [ ]: